Goto

Collaborating Authors

 Interfaces


Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds

Neural Information Processing Systems

Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.


emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation Sasha Salter, Richard Warren

Neural Information Processing Systems

Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement; existing sEMG models have thus required hundreds of users and device placements to effectively generalize for tasks other than pose inference. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, which is to our knowledge the first publicly available dataset of high-quality hand pose labels and wrist sEMG recordings.


SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Neural Information Processing Systems

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.


Our favorite budget video doorbell gets an upgrade - see what's new with Amazon's Blink

ZDNet

ZDNET's pick for the best budget-friendly video doorbell is now getting an upgrade, as Amazon has announced the launch of a new generation of the Blink Video Doorbell. The second-generation Blink Video Doorbell is being released four years after the first; it comes alongside the new Blink Sync Module Core and, unfortunately, sports a price increase. The new Blink Video Doorbell is available now for 70, a 20 price increase compared to the first model, which cost 50 when it was released. Also: I test smart locks for a living, and the most reliable one I've used is on sale for 130 However, the higher price brings some upgrades. The new Blink Video Doorbell has a 150-degree field of view, which gives you a bigger picture of people from head to toe and any packages that may be on your porch.


New Blink Video Doorbell promises 2 years of runtime on 3 AA batteries

PCWorld

Amazon's Blink division--the budget-oriented sibling of Ring in the smart home space--has announced an all-new version of its affordable battery-powered Blink Video Doorbell, which is available now for 69.99. While the price has gone up by 10 compared to the previous model, the new version's higher-resolution camera (1440 1440 pixels), 150-degree field of view, and 1:1 aspect ratio will provide a head-to-toe view of visitors. Blink is bundling the new doorbell with its new AC-powered Blink Sync Module Core, which acts as a bridge to the home's Wi-Fi network (it does not have a chime that sounds off when a visitor rings the doorbell). Blink says having the doorbell connect to the Core, instead of directly to a home's router, is what enables the doorbell's remarkable 2-year battery life. The latest Blink Video Doorbell comes bundled with the Blink Sync Core, which serves as a bridge to a home's Wi-Fi network. Since the Sync Module Core does not offer any means for local storage--unlike the Sync Module 2 (with a USB-A socket) and the Sync Module XR (with a microSD card slot)--buyers of the latest Blink Video Doorbell will need to either sign up for a subscription to store video recordings in the cloud, or they'll need to replace the Core with either of Blink's other two Sync Modules.


MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Neural Information Processing Systems

Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model to improve gesture synthesis. However, the high computational complexity of these techniques limits the application in reality. In this study, we explore the potential of state space models (SSMs). Direct application of SSMs in gesture synthesis encounters difficulties, which stem primarily from the diverse movement dynamics of various body parts. The generated gestures may also exhibit unnatural jittering issues. To address these, we implement a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Built upon the selective scan mechanism, we introduce MambaTalk, which integrates hybrid fusion modules, local and global scans to refine latent space representations. Subjective and objective experiments demonstrate that our method surpasses the performance of state-of-the-art models.


EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

Neural Information Processing Systems

Figure 2: The figure presents still images extracted from 360 videos used in the experiment to display various environments to the participants. The videos were selected from the publically available 360 VR video dataset (Li et al. (2017).) The EEVR dataset comprises synchronized pairs of physiological signals and textual data. It includes responses to four self-assessment questions regarding perceived arousal, valence, dominance, and discrete emotions ratings collected using PANAS questionnaires (which were further utilized to calculate Positive and Negative Affect Score). The EEVR dataset was collected using Virtual Reality (VR) 360 videos as the elicitation medium. The videos utilized in the dataset were selected based on their arousal and valence ratings to cover all four quadrants of the Russell circumplex emotion model (Russell et al. (1989)), as shown in Figure 2. The remainder of the supplementary materials provide detailed information about the EEVR dataset. Figure 3 provides a datasheet for the EEVR dataset based on Gebru et al. (2018).


EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning

Neural Information Processing Systems

EEVR (Emotion Elicitation in Virtual Reality) is a novel dataset specifically designed for language supervision-based pre-training of emotion recognition tasks, such as valence and arousal classification. It features high-quality physiological signals, including electrodermal activity (EDA) and photoplethysmography (PPG), acquired through emotion elicitation via 360-degree virtual reality (VR) videos. Additionally, it includes subject-wise textual descriptions of emotions experienced during each stimulus gathered from qualitative interviews. The dataset consists of recordings from 37 participants and is the first dataset to pair raw text with physiological signals, providing additional contextual information that objective labels cannot offer. To leverage this dataset, we introduced the Contrastive Language Signal Pre-training (CLSP) method, which jointly learns representations using pairs of physiological signals and textual descriptions. Our results show that integrating self-reported textual descriptions with physiological signals significantly improves performance on emotion recognition tasks, such as arousal and valence classification. Moreover, our pre-trained CLSP model demonstrates strong zero-shot transferability to existing datasets, outperforming supervised baseline models, suggesting that the representations learned by our method are more contextualized and generalized. The dataset also includes baseline models for arousal, valence, and emotion classification, as well as code for data cleaning and feature extraction.


LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Neural Information Processing Systems

Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.


ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

Neural Information Processing Systems

This paper proposes ProEdit - a simple yet effective framework for high-quality 3D scene editing guided by diffusion distillation in a novel progressive manner. Inspired by the crucial observation that multi-view inconsistency in scene editing is rooted in the diffusion model's large feasible output space (FOS), our framework controls the size of FOS and reduces inconsistency by decomposing the overall editing task into several subtasks, which are then executed progressively on the scene.